OcrV1, Main, Exploration, bibRecord, 001950

A 300 MB Turkish Corpus and Word Analysis

Identifieur interne : 001950 ( Main/Exploration ); précédent : 001949; suivant : 001951

A 300 MB Turkish Corpus and Word Analysis

Auteurs : Gökhan Dalkilic [Turquie] ; Yalcin Cebi [Turquie]

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 2002.

RBID : ISTEX:8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290

Abstract

Abstract: In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ~300 MB capacity and more than 44 million words was prepared by using 10 different web sites having Turkish content. Most frequently used word statistics of Turkish were calculated by using this corpus. Frequencies of most frequently used first 7 words were compared with their equivalent in English, and it was found out that most frequently used words are not nouns in natural languages Most frequently used words having 1 to 5 letters were determined and they were applied onto a randomly selected text in order to test the validity of the process.

Url:

https://api.istex.fr/document/8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290/fulltext/pdf

DOI: 10.1007/3-540-36077-8_20

Affiliations:

Turquie

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 001A17
to stream Istex, to step Curation: 001912
to stream Istex, to step Checkpoint: 001064
to stream Main, to step Merge: 001A30
to stream Main, to step Curation: 001950

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">A 300 MB Turkish Corpus and Word Analysis</title>
<author><name sortKey="Dalkilic, Gokhan" sort="Dalkilic, Gokhan" uniqKey="Dalkilic G" first="Gökhan" last="Dalkilic">Gökhan Dalkilic</name>
</author>
<author><name sortKey="Cebi, Yalcin" sort="Cebi, Yalcin" uniqKey="Cebi Y" first="Yalcin" last="Cebi">Yalcin Cebi</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290</idno>
<date when="2002" year="2002">2002</date>
<idno type="doi">10.1007/3-540-36077-8_20</idno>
<idno type="url">https://api.istex.fr/document/8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001A17</idno>
<idno type="wicri:Area/Istex/Curation">001912</idno>
<idno type="wicri:Area/Istex/Checkpoint">001064</idno>
<idno type="wicri:doubleKey">0302-9743:2002:Dalkilic G:a:mb:turkish</idno>
<idno type="wicri:Area/Main/Merge">001A30</idno>
<idno type="wicri:Area/Main/Curation">001950</idno>
<idno type="wicri:Area/Main/Exploration">001950</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">A 300 MB Turkish Corpus and Word Analysis</title>
<author><name sortKey="Dalkilic, Gokhan" sort="Dalkilic, Gokhan" uniqKey="Dalkilic G" first="Gökhan" last="Dalkilic">Gökhan Dalkilic</name>
<affiliation wicri:level="1"><country xml:lang="fr">Turquie</country>
<wicri:regionArea>Computer Engineering Dept., Dokuz Eylul University, 35100, Bornova, Izmir</wicri:regionArea>
<wicri:noRegion>Izmir</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Turquie</country>
</affiliation>
</author>
<author><name sortKey="Cebi, Yalcin" sort="Cebi, Yalcin" uniqKey="Cebi Y" first="Yalcin" last="Cebi">Yalcin Cebi</name>
<affiliation wicri:level="1"><country xml:lang="fr">Turquie</country>
<wicri:regionArea>Computer Engineering Dept., Dokuz Eylul University, 35100, Bornova, Izmir</wicri:regionArea>
<wicri:noRegion>Izmir</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Turquie</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2002</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290</idno>
<idno type="DOI">10.1007/3-540-36077-8_20</idno>
<idno type="ChapterID">20</idno>
<idno type="ChapterID">Chap20</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ~300 MB capacity and more than 44 million words was prepared by using 10 different web sites having Turkish content. Most frequently used word statistics of Turkish were calculated by using this corpus. Frequencies of most frequently used first 7 words were compared with their equivalent in English, and it was found out that most frequently used words are not nouns in natural languages Most frequently used words having 1 to 5 letters were determined and they were applied onto a randomly selected text in order to test the validity of the process.</div>
</front>
</TEI>
<affiliations><list><country><li>Turquie</li>
</country>
</list>
<tree><country name="Turquie"><noRegion><name sortKey="Dalkilic, Gokhan" sort="Dalkilic, Gokhan" uniqKey="Dalkilic G" first="Gökhan" last="Dalkilic">Gökhan Dalkilic</name>
</noRegion>
<name sortKey="Cebi, Yalcin" sort="Cebi, Yalcin" uniqKey="Cebi Y" first="Yalcin" last="Cebi">Yalcin Cebi</name>
<name sortKey="Cebi, Yalcin" sort="Cebi, Yalcin" uniqKey="Cebi Y" first="Yalcin" last="Cebi">Yalcin Cebi</name>
<name sortKey="Dalkilic, Gokhan" sort="Dalkilic, Gokhan" uniqKey="Dalkilic G" first="Gökhan" last="Dalkilic">Gökhan Dalkilic</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001950 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001950 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290
   |texte=   A 300 MB Turkish Corpus and Word Analysis
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

A 300 MB Turkish Corpus and Word Analysis

A 300 MB Turkish Corpus and Word Analysis

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri